Active Cleaning for Video Corpus Annotation
نویسندگان
چکیده
In this paper, we have described the active cleaning approach that was used to complement the active learning approach in the TRECVID collaborative annotation. It consists in using a classification system in order to select the most informative samples for multiple annotations, in order to improve the quality and the reliability of the annotations. We have evaluated the actual impact of the active cleaning approach on TRECVID 2007 collection. The evaluations were conducted using complete annotations that were collected from different resources, including the TRECVID collaborative annotations and the MCG-ICT-CAS annotations. From our experiments, a significant improvement of the annotation quality was observed when applying the cleaning by cross-validation strategy, which selects the samples to be re-annotated. Experiments show that higher performance can be reached with a double annotations of 10% of negative samples or 5% of all the annotated samples selected by the proposed cleaning strategy using crossvalidation. It has been shown that, with an appropriate strategy, using a small fraction of the annotations for cleaning improves much more the system’s performance than using the same fraction for adding more annotations.
منابع مشابه
Annotation of Clinical Narratives in Bulgarian language
In this paper we describe annotation process of clinical texts with morphosyntactic and semantic information. The corpus contains 1,300 discharge letters in Bulgarian language for patients with Endocrinology and Metabolic disorders. The annotated corpus will be used as a Gold standard for information extraction evaluation of test corpus of 6,200 discharge letters. The annotation is performed wi...
متن کاملIndexation sémantique des images et des vidéos par apprentissage actif. (Semantic indexing of images and videos by active learning)
The general framework of this thesis is semantic indexing and information retrieval, applied to multimedia documents. More specifically, we are interested in the semantic indexing of concepts in images and videos by the active learning approaches that we use to build annotated corpus. Through out this thesis, we have shown that the main difficulties of this task are often related, in general, t...
متن کاملIntroducing the Reference Corpus of Contemporary Portuguese Online
We present our work in processing the Reference Corpus of Contemporary Portuguese and its publication online. After discussing how the corpus was built and our choice of meta-data, we turn to the processes and tools involved for the cleaning, preparation and annotation to make the corpus suitable for linguistic inquiries. The Web platform is described, and we show examples of linguistic resourc...
متن کامل7x1-PT: um Corpus extraído do Twitter para Análise de Sentimentos em Língua Portuguesa (7x1-PT: a Corpus extracted from Twitter for Sentiment Analysis in Portuguese Language)
This paper describes the 7x1PT corpus that contains a set of tweets, in Portuguese, posted during the match Germany vs Brazil at the FIFA World Cup 2014. We describe data collection, cleaning and organization, and also the current stage of the linguistic annotation of this corpus.
متن کاملA Large Portuguese Corpus On-Line: Cleaning and Preprocessing
We present a newly available on-line resource for Portuguese, a corpus of 310 million words, a new version of the Reference Corpus of Contemporary Portuguese, now searchable via a user-friendly web interface. Here we report on work carried out on the corpus previous to its publication on-line. We focus on the processes and tools involved for the cleaning, preparation and annotation to make the ...
متن کامل